Predictive analytics: Poor outcome

Thoughts

Questions

  • What deficit are we talking about? Is it covered by the mRS?
  • Should the pre-treatment mRS be in the model? Already covered by the others?

Backlog

  • Honest estimation
  • G estimation (using only significant ones to get accurate ORs)
  • Make sure that the number of outcomes (26) is valid.
  • Bootstrap the whole process - how many models include each of the features?
  • The SuSiE method: https://www.biorxiv.org/content/10.1101/501114v1
  • Choose best model with forward/backward exclusion
  • Feature selection with honest estimation + p-value-based + bootstrap
  • Feature selection with honest estimation + LASSO-based + bootstrap
  • Implement PCA to create independent variables and not drop variables
  • Bayesian fitting
  • Change to use tidybayes and tidymodels
  • Use neural networks to fit
  • Consider a time-to-event analysis

Setup

Imports

# Load packages
library(glmnet)
library(magrittr)
library(selectiveInference)
library(tidyverse)

# Data
# https://office365stanford-my.sharepoint.com/:x:/r/personal/simonlev_stanford_edu/_layouts/15/Doc.aspx?sourcedoc=%7BCE7DAF19-8F1A-43B0-866D-3D0F36336EDC%7D&file=Peds%20AVM%20Database%20July%2010%202024%20Onwards.xlsx&fromShare=true&action=default&mobileredirect=true

# Source code

Configurations

# Source outcome-specific configs
source(params$CONFIG)

# Destination locations
DST_DIRNAME <- paste0("../../outputs/")

# Source locations
SRC_DIRNAME <- "../../data/3_tidy/patients"
SRC_BASENAME <- "patients-daily_2024-07-21.csv"

# Should the Dockerfile be automatically updated?
UPDATE_DOCKERFILE <- FALSE

# Set the seed
set.seed(1891)

Parameters

EXPOSURES_CONTINUOUS <- c(
  "Age at 1st treatment (years)" = "age_at_first_treatment_yrs",
  "mRS (presentation)" = "modified_rankin_score_presentation",
  "mRS (pre-treatment)" = "modified_rankin_score_pretreatment",
  "mRS (1-week post-op)" = "modified_rankin_score_postop_within_1_week",
  "mRS (final)" = "modified_rankin_score_final",
  "Nidus size (cm)" = "max_size_cm",
  "Spetzler-Martin grade" = "spetzler_martin_grade",
  "Lawton-Young grade" = "lawton_young_grade",
  "Size score" = "size_score"
)

EXPOSURES_BINARY <- c(
  "Sex (Male)" = "is_male",
  "Involves eloquent location" = "is_eloquent_location",
  "Has deep venous drainage" = "has_deep_venous_drainage",
  "Diffuse nidus" = "is_diffuse_nidus",
  "Hemorrhage at presentation" = "has_hemorrhage",
  "Seizures at presentation" = "has_seizures",
  "Deficit at presentation" = "has_deficit",
  "Paresis at presentation" = "has_paresis",
  "Presence of aneurysms" = "has_associated_aneurysm",
  "Spetzler-Martin grade < 4" = "is_spetzler_martin_grade_less_than_4",
  "Had surgery" = "is_surgery"
)

EXPOSURES_CATEGORICAL <- c(
  "Location" = "location",
  "Venous drainage" = "venous_drainage",
  "Modality of treatment" = "procedure_combinations"
)

OUTCOMES <- c(
  "Poor outcome (mRS >= 3)" = "is_poor_outcome",
  "Obliteration" = "is_obliterated",
  "Complications - minor" = "has_complication_minor",
  "Complications - major" = "has_complication_major",
  "mRS change (final - pre-treatment)" =
    "modified_rankin_score_final_minus_pretx",
  "mRS change (final - presentation)" =
    "modified_rankin_score_final_minus_presentation",
  "mRS change direction (Final - Pre-tx)" = "change_in_mrs_tx_vs_final"
)

Outputs

Create lists.

plots <- list()
tables <- list()
models <- list()

Functions

R utils.

Data analysis utils.


Read

# File path
filepath <- file.path(SRC_DIRNAME, SRC_BASENAME)

# Read
patients_ <-
  read_csv(filepath) %>%
  dplyr::sample_frac(params$PROPORTION_OF_DATA)

# If analyzing a subset of the data, restrict the dataset
if (params$TITLE == "Poor outcome - Hemorrhage") {
  patients_ <- patients_ %>% filter(has_hemorrhage)
}
if (params$TITLE == "Poor outcome - No Hemorrhage") {
  patients_ <- patients_ %>% filter(!has_hemorrhage)
}

Conform

Setup

Remove all empty rows.

patients_ <-
  patients_ %>%
  filter(!is.na(patient_id)) %>%
  arrange(patient_id)

Create two dataframes - one for univariable and one for multivariable analysis. This is because variables for multivariable analysis will be recoded to reduce the number of levels to avoid overfitting the model.

df_uni <-
  patients_ %>%
  filter(is_eligible) %>% 
  filter(!is.na(spetzler_martin_grade))

df_multi <-
  patients_ %>%
  filter(is_eligible) %>% 
  filter(!is.na(spetzler_martin_grade))

Recode columns

For venous_drainage if “Both”, mark as “Deep.”

df_multi <-
  df_multi %>%
  mutate(
    venous_drainage = ifelse(venous_drainage == "Both", "Deep", venous_drainage)
  )

For procedure_combinations, change > 1 to multimodal to reduce levels.

df_multi <-
  df_multi %>%
  mutate(procedure_combinations = case_when(
    nchar(procedure_combinations) > 1 ~ "Multimodal",
    .default = procedure_combinations
  ))

For procedure_combinations, indicate if surgery-based or not.

df_multi <-
  df_multi %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

df_uni <-
  df_uni %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

For location, reduce choices (use “Other”).

df_multi <-
  df_multi %>%
  mutate(
    location = ifelse(location == "Corpus Callosum", "Other", location),
    location = ifelse(location == "Cerebellum", "Other", location),
    # location = ifelse(location == "Deep", "Other", location),
    location = factor(location),
    location = relevel(location, ref = "Other")
  )

For spetzler_martin_grade, create a binary variant of 1-3 vs. 4-5.

# For multivariable analysis
df_multi <-
  df_multi %>%
  mutate(
    # Divides into low grade vs. high grade
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

# For univariable analysis
df_uni <-
  df_uni %>%
  mutate(
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

Missingness

# Get cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
))

# Count missingness
df_multi %>%
  select(cols) %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(everything(), values_to = "num_missing") %>%
  arrange(desc(num_missing)) %>%
  filter(num_missing > 0) %>%
  sable()
name num_missing
modified_rankin_score_pretreatment 2
modified_rankin_score_postop_within_1_week 2
lawton_young_grade 2
has_associated_aneurysm 2
is_surgery 2
procedure_combinations 2
modified_rankin_score_final_minus_pretx 2
is_male 1
is_diffuse_nidus 1
location 1
has_complication_minor 1
has_complication_major 1

Which eligible patients are missing each variable?

df_multi %>%
  filter(is_eligible) %>%
  select(mrn, cols) %>%
  mutate(across(-mrn, is.na)) %>%
  pivot_longer(
    cols = -mrn,
    names_to = "name",
    values_to = "is_missing"
  ) %>%
  filter(is_missing) %>%
  group_by(name) %>%
  summarize(mrn = str_c(mrn, collapse = ", ")) %>%
  sable()
name mrn
has_associated_aneurysm 60310463, 61011961
has_complication_major 62023254
has_complication_minor 62023254
is_diffuse_nidus 15871379
is_male 19399469
is_surgery 49659493, 46166047
lawton_young_grade 60310463, 61011961
location 60310463
modified_rankin_score_final_minus_pretx 49659493, 46166047
modified_rankin_score_postop_within_1_week 49659493, 46166047
modified_rankin_score_pretreatment 49659493, 46166047
procedure_combinations 49659493, 46166047

Visualizations

Overall

Distribution of values of the exposures across levels of the outcome.

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(-CHOSEN_OUTCOME, names_to = "predictor", values_to = "value")

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = CHOSEN_OUTCOME,
    color = CHOSEN_OUTCOME
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11),
    legend.position = "none"
  )

By hemorrhage

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Hemorrhage at presentation`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Hemorrhage at presentation`",
    color = "`Hemorrhage at presentation`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )

By SM grade high vs. low

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Spetzler-Martin grade < 4`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Spetzler-Martin grade < 4`",
    color = "`Spetzler-Martin grade < 4`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )


Descriptive statistics

Setup

Descriptive statistics - continuous variables.

compute_descriptive_continuous <- function(
    df = df_uni,
    cols = c(EXPOSURES_CONTINUOUS),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Apply desired formatting to numbers
  format_numbers <- function(x) {
    sprintf("%.1f", x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(cols, CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      mean = mean(values, na.rm = TRUE),
      sd = sd(values, na.rm = TRUE),
      min = quantile(values, 0, na.rm = TRUE),
      max = quantile(values, 1, na.rm = TRUE),
      q_50 = quantile(values, 0.50, na.rm = TRUE),
      q_25 = quantile(values, 0.25, na.rm = TRUE),
      q_75 = quantile(values, 0.75, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{format_numbers(mean)} ({format_numbers(sd)})"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[2], .after = CHOSEN_OUTCOME_LABELS[1])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(keys) %>%
    summarize(
      pvalue = wilcox.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Add sample size to column names
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[1]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Descriptive statistics - binary variables.

compute_descriptive_binary <- function(
    df = df_uni,
    cols = c(EXPOSURES_BINARY),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Define arguments
  cols <- c(cols[!cols == interaction_col])

  # Apply desired formatting to numbers
  format_numbers <- function(x, decimals = 0) {
    decimals <- paste0("%.", decimals, "f")
    sprintf(decimals, x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(all_of(cols), CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      num_with = sum(values, na.rm = TRUE),
      num_without = sum(!values, na.rm = TRUE),
      pct_with = mean(values, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{num_with} ({format_numbers(pct_with*100)}%)"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[1], .after = CHOSEN_OUTCOME_LABELS[2])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(keys) %>%
    summarize(
      pvalue = kruskal.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Prettify
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[2]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Non-specific

# Rename dataframe
`Pediatric AVMs` <-
  df_uni %>%
  filter(is_eligible) %>%
  select(-is_eligible, -comments)

# Create summary
`Pediatric AVMs` %>%
  summarytools::dfSummary(display.labels = FALSE) %>%
  print(
    file =
      "../../outputs/descriptive-statistics/descriptive_statistics_eligible.html",
    footnote = NA
  )

# Remove unwanted dataframe
remove(`Pediatric AVMs`)

By hemorrhage - table1 (WIP)

https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html

library(table1)

table1::table1(
  ~ factor(is_male) + age_at_first_treatment_yrs + factor(is_eloquent_location) + max_size_cm | has_hemorrhage,
  data = df_uni
)
# Initialize values
df <- df_uni

# Remove missing values in stratification variables
df <-
  df %>%
  drop_na(has_hemorrhage, CHOSEN_OUTCOME)

# Define columns
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  OUTCOMES[OUTCOMES == CHOSEN_OUTCOME]
)

# Assign labels to the variables in the dataframe
for (var in cols) {
  label(df[[var]]) <- names(cols[cols == var])
}

# Assign meaningful labels to the has_hemorrhage variable
df$has_hemorrhage <- factor(
  df$has_hemorrhage,
  levels = c(FALSE, TRUE),
  labels = c("No Hemorrhage", "Has Hemorrhage")
)

df[, CHOSEN_OUTCOME] <- factor(
  df %>% pull(CHOSEN_OUTCOME),
  levels = CHOSEN_OUTCOME_LEVELS,
  labels = CHOSEN_OUTCOME_LABELS
)

# Define custom rendering functions if needed
render_continuous <- function(x) {
  with(stats.apply.rounding(stats.default(x), digits = 2), c(
    "Mean (SD)" = sprintf("%s (&plusmn; %s)", MEAN, SD),
    "Median (IQR)" = sprintf("%s (%s - %s)", MEDIAN, Q1, Q3)
  ))
}

compute_pvalues <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times = sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform a standard 2-sample t-test
    p <- t.test(y ~ g)$p.value
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits = 3, eps = 0.001)))
}

# Create the descriptive table
frla <- as.formula(paste("~ . | has_hemorrhage *", CHOSEN_OUTCOME))

# Create table
table1_descriptive_statistics <-
  table1(frla,
    data = df[, unname(cols)],
    render.continuous = render_continuous,
    overall = c(left = "Overall"),
    topclass = "Rtable1-zebra"
  )

# Print
table1_descriptive_statistics

By hemorrhage

if (!str_detect(params$TITLE, "Hemorrhage")) {
  
  # Define arguments
  interaction_col <- "has_hemorrhage"
  interaction_col_name <- "Hemorrhage at presentation"
  prefix_true <- "Haem"
  prefix_false <- "No Haem"
  
  # Get all
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = F,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = "All"
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = F,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = "All"
  )
  out_all <- bind_rows(out)
  
  # Get with hemorrhage
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = prefix_true
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = prefix_true
  )
  out_haem <- bind_rows(out)
  
  # Get without hemorrhage
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = F,
    prefix = prefix_false
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = F,
    prefix = prefix_false
  )
  out_no_haem <- bind_rows(out)
  
  # Synthesize
  out_complete <-
    out_all %>%
    left_join(out_haem, by = "keys") %>%
    left_join(out_no_haem, by = "keys") %>%
    arrange(`All - P-value`)
  
  # Print
  out_complete %>%
    sable()
  
}
keys All - Poor outcome (n=26) All - Good outcome (n=202) All - P-value Haem - Poor outcome (n=15) Haem - Good outcome (n=122) Haem - P-value No Haem - Poor outcome (n=11) No Haem - Good outcome (n=80) No Haem - P-value
mRS (final) 0.6 (0.7) 3.8 (1.1) 0.00000 0.7 (0.7) 3.9 (1.0) 0.00000 0.6 (0.7) 3.6 (1.2) 0.00000
mRS (1-week post-op) 1.4 (1.2) 3.5 (1.1) 0.00000 1.7 (1.2) 4.0 (0.8) 0.00000 1.0 (1.0) 2.9 (1.1) 0.00002
mRS (pre-treatment) 1.5 (1.4) 2.7 (1.5) 0.00015 1.9 (1.5) 3.3 (1.5) 0.00162 0.9 (0.8) 1.8 (1.3) 0.00887
Diffuse nidus 9 (35%) 22 (11%) 0.00097 4 (27%) 12 (10%) 0.05636 5 (45%) 10 (13%) 0.00654
Spetzler-Martin grade 2.8 (1.2) 3.6 (1.3) 0.00117 2.6 (1.1) 3.3 (1.3) 0.04461 3.0 (1.2) 4.1 (1.0) 0.00843
Spetzler-Martin grade < 4 11 (42%) 148 (73%) 0.00125 8 (53%) 98 (80%) 0.01881 3 (27%) 50 (62%) 0.02717
Involves eloquent location 23 (88%) 116 (57%) 0.00231 12 (80%) 64 (52%) 0.04360 11 (100%) 52 (65%) 0.01902
Lawton-Young grade 4.3 (1.5) 5.3 (1.8) 0.00525 3.8 (1.4) 4.5 (1.6) 0.07814 5.0 (1.5) 6.3 (1.5) 0.01344
Size score 1.6 (0.7) 2.0 (0.8) 0.01352 1.5 (0.6) 1.7 (0.8) 0.20772 1.9 (0.7) 2.5 (0.7) 0.01752
Nidus size (cm) 3.3 (2.1) 4.8 (3.1) 0.01682 2.9 (1.9) 3.9 (3.2) 0.33159 3.9 (2.4) 6.1 (2.6) 0.00541
Deficit at presentation 15 (58%) 70 (35%) 0.02251 6 (40%) 45 (37%) 0.81450 9 (82%) 25 (31%) 0.00123
Age at 1st treatment (years) 11.5 (3.7) 9.6 (4.3) 0.03478 11.2 (3.4) 9.1 (4.2) 0.06577 12.0 (4.2) 10.3 (4.6) 0.20988
mRS (presentation) 2.2 (1.8) 3.0 (1.6) 0.05041 3.0 (1.7) 3.7 (1.5) 0.24509 1.0 (0.9) 1.9 (1.1) 0.00561
Has deep venous drainage 17 (65%) 108 (53%) 0.25139 9 (60%) 73 (60%) 0.99028 8 (73%) 35 (44%) 0.07265
Paresis at presentation 9 (35%) 49 (24%) 0.25472 6 (40%) 34 (28%) 0.33126 3 (27%) 15 (19%) 0.50818
Presence of aneurysms 6 (23%) 40 (20%) 0.71455 4 (27%) 30 (25%) 0.86104 2 (18%) 10 (13%) 0.62793
Hemorrhage at presentation 15 (58%) 122 (60%) 0.79147 15 (100%) 122 (100%) 0 (0%) 0 (0%)
Seizures at presentation 6 (23%) 50 (25%) 0.85212 1 (7%) 28 (23%) 0.14661 5 (45%) 22 (28%) 0.22416
Had surgery 15 (58%) 118 (59%) 0.89880 9 (60%) 81 (68%) 0.56273 6 (55%) 37 (46%) 0.60735
Sex (Male) 14 (54%) 106 (53%) 0.91523 6 (40%) 67 (55%) 0.27623 8 (73%) 39 (49%) 0.14843

By diffuse nidus

# Define arguments
interaction_col <- "is_diffuse_nidus"
interaction_col_name <- "Is diffuse nidus"
prefix_true <- "Diffuse"
prefix_false <- "Not Diffuse"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Poor outcome (n=26) All - Good outcome (n=201) All - P-value Diffuse - Poor outcome (n=9) Diffuse - Good outcome (n=22) Diffuse - P-value Not Diffuse - Poor outcome (n=17) Not Diffuse - Good outcome (n=179) Not Diffuse - P-value
mRS (final) 0.7 (0.7) 3.8 (1.1) 0.00000 1.0 (0.7) 4.0 (1.2) 0.00001 0.6 (0.7) 3.7 (1.0) 0.00000
mRS (1-week post-op) 1.4 (1.2) 3.5 (1.1) 0.00000 1.3 (0.9) 3.4 (1.1) 0.00033 1.4 (1.2) 3.6 (1.1) 0.00000
mRS (pre-treatment) 1.5 (1.4) 2.7 (1.5) 0.00015 1.3 (1.1) 2.9 (1.5) 0.01244 1.5 (1.4) 2.5 (1.6) 0.00473
Is diffuse nidus 9 (35%) 22 (11%) 0.00097 9 (100%) 22 (100%) 0 (0%) 0 (0%)
Spetzler-Martin grade 2.8 (1.2) 3.6 (1.3) 0.00124 4.3 (0.9) 4.4 (0.7) 0.70194 2.6 (1.1) 3.2 (1.3) 0.05491
Spetzler-Martin grade < 4 11 (42%) 147 (73%) 0.00133 1 (11%) 4 (18%) 0.63268 10 (59%) 143 (80%) 0.04544
Involves eloquent location 23 (88%) 115 (57%) 0.00218 9 (100%) 20 (91%) 0.35757 14 (82%) 95 (53%) 0.02055
Lawton-Young grade 4.3 (1.5) 5.3 (1.8) 0.00557 6.7 (1.2) 6.8 (1.1) 0.98188 4.0 (1.3) 4.5 (1.5) 0.19408
Size score 1.6 (0.7) 2.0 (0.8) 0.01433 2.5 (0.7) 2.6 (0.7) 0.63766 1.5 (0.6) 1.8 (0.8) 0.20482
Nidus size (cm) 3.3 (2.1) 4.8 (3.1) 0.01704 5.9 (3.0) 7.5 (3.1) 0.07645 3.0 (1.8) 3.4 (2.0) 0.39291
Deficit at presentation 15 (58%) 69 (34%) 0.02052 8 (89%) 10 (45%) 0.02864 7 (41%) 59 (33%) 0.49447
Age at 1st treatment (years) 11.5 (3.7) 9.6 (4.3) 0.03236 11.4 (3.7) 8.4 (3.1) 0.05347 11.5 (3.7) 10.2 (4.8) 0.37518
mRS (presentation) 2.2 (1.8) 3.0 (1.6) 0.05061 1.6 (1.4) 3.0 (1.6) 0.03688 2.3 (1.8) 2.9 (1.7) 0.15219
Paresis at presentation 9 (35%) 48 (24%) 0.23596 5 (56%) 7 (32%) 0.22567 4 (24%) 41 (23%) 0.95347
Has deep venous drainage 17 (65%) 108 (54%) 0.26204 7 (78%) 20 (91%) 0.33013 10 (59%) 88 (49%) 0.44760
Presence of aneurysms 6 (23%) 39 (20%) 0.67731 2 (22%) 8 (36%) 0.45199 4 (24%) 31 (18%) 0.53889
Hemorrhage at presentation 15 (58%) 122 (61%) 0.76874 4 (44%) 12 (55%) 0.61530 11 (65%) 110 (61%) 0.79249
Seizures at presentation 6 (23%) 50 (25%) 0.84167 4 (44%) 7 (32%) 0.51177 2 (12%) 43 (24%) 0.25202
Had surgery 15 (58%) 118 (59%) 0.87594 4 (44%) 11 (50%) 0.78225 11 (65%) 107 (60%) 0.73211
Sex (Male) 14 (54%) 106 (53%) 0.93532 6 (67%) 10 (45%) 0.29129 8 (47%) 96 (54%) 0.58825

By SM grade high vs. low

# Define arguments
interaction_col <- "is_spetzler_martin_grade_less_than_4"
interaction_col_name <- "Spetzler-Martin grade < 4"
prefix_true <- "Low grade"
prefix_false <- "High grade"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Poor outcome (n=26) All - Good outcome (n=202) All - P-value Low grade - Poor outcome (n=11) Low grade - Good outcome (n=148) Low grade - P-value High grade - Poor outcome (n=15) High grade - Good outcome (n=54) High grade - P-value
mRS (final) 0.6 (0.7) 3.8 (1.1) 0.00000 0.6 (0.7) 3.5 (1.0) 0.00000 0.8 (0.7) 4.0 (1.1) 0.00000
mRS (1-week post-op) 1.4 (1.2) 3.5 (1.1) 0.00000 1.4 (1.2) 3.6 (0.8) 0.00000 1.4 (0.9) 3.5 (1.2) 0.00000
mRS (pre-treatment) 1.5 (1.4) 2.7 (1.5) 0.00015 1.5 (1.5) 2.7 (1.6) 0.01192 1.3 (1.0) 2.6 (1.5) 0.00387
Diffuse nidus 9 (35%) 22 (11%) 0.00097 1 (9%) 4 (3%) 0.24589 8 (53%) 18 (33%) 0.16038
Spetzler-Martin grade 2.8 (1.2) 3.6 (1.3) 0.00117 2.2 (0.8) 2.4 (0.8) 0.47514 4.3 (0.5) 4.5 (0.5) 0.12416
Spetzler-Martin grade < 4 11 (42%) 148 (73%) 0.00125 11 (100%) 148 (100%) 0 (0%) 0 (0%)
Involves eloquent location 23 (88%) 116 (57%) 0.00231 8 (73%) 65 (44%) 0.06518 15 (100%) 51 (94%) 0.35413
Lawton-Young grade 4.3 (1.5) 5.3 (1.8) 0.00525 3.6 (1.0) 3.6 (1.0) 0.79642 6.2 (1.0) 6.5 (1.1) 0.36782
Size score 1.6 (0.7) 2.0 (0.8) 0.01352 1.3 (0.5) 1.3 (0.5) 0.75197 2.5 (0.5) 2.6 (0.5) 0.42491
Nidus size (cm) 3.3 (2.1) 4.8 (3.1) 0.01682 2.5 (1.3) 2.1 (1.0) 0.26545 5.5 (2.5) 6.8 (2.5) 0.02799
Deficit at presentation 15 (58%) 70 (35%) 0.02251 5 (45%) 47 (32%) 0.35167 10 (67%) 23 (43%) 0.10117
Age at 1st treatment (years) 11.5 (3.7) 9.6 (4.3) 0.03478 11.5 (3.5) 10.4 (4.7) 0.53070 11.5 (4.2) 9.1 (4.1) 0.05221
mRS (presentation) 2.2 (1.8) 3.0 (1.6) 0.05041 2.4 (1.9) 3.4 (1.7) 0.10960 1.8 (1.5) 2.7 (1.6) 0.05260
Has deep venous drainage 17 (65%) 108 (53%) 0.25139 5 (45%) 63 (43%) 0.85234 12 (80%) 45 (83%) 0.76485
Paresis at presentation 9 (35%) 49 (24%) 0.25472 4 (36%) 34 (23%) 0.31656 5 (33%) 15 (28%) 0.67705
Presence of aneurysms 6 (23%) 40 (20%) 0.71455 2 (18%) 27 (18%) 0.97959 4 (27%) 13 (24%) 0.83785
Hemorrhage at presentation 15 (58%) 122 (60%) 0.79147 8 (73%) 98 (66%) 0.65952 7 (47%) 24 (44%) 0.87922
Seizures at presentation 6 (23%) 50 (25%) 0.85212 1 (9%) 35 (24%) 0.26719 5 (33%) 15 (28%) 0.67705
Had surgery 15 (58%) 118 (59%) 0.89880 7 (64%) 96 (66%) 0.88701 8 (53%) 22 (41%) 0.38758
Sex (Male) 14 (54%) 106 (53%) 0.91523 4 (36%) 79 (54%) 0.26710 10 (67%) 27 (50%) 0.25564

Associations

Correlation matrix.

# Cols
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
)

# Remove hemorrhage
if (str_detect(params$TITLE, "Hemorrhage")) {
  cols <- cols[!cols == "has_hemorrhage"]
}
  
# Create new dataframe
df_ <-
  df_uni %>%
  select(all_of(cols))

# Convert the outcome variable to numeric for correlation calculation
df_ <-
  df_ %>%
  select(where(is.numeric), where(is.logical)) %>%
  mutate(across(everything(), as.numeric))

# Compute the correlation matrix
cor_matrix <- cor(df_, use = "complete.obs", method = "spearman")

# Plot the correlation matrix
ggcorrplot::ggcorrplot(
  cor_matrix,
  method = "circle",
  lab = TRUE,
  lab_size = 2,
  colors = c("red", "white", "green4"),
  title = "Correlation Matrix",
  hc.order = TRUE,
) +
  theme(
    axis.text.x = element_text(size = 8), # Adjust x-axis text size
    axis.text.y = element_text(size = 8) # Adjust y-axis text size
  )


Univariable statistics

Setup

Define function.

fit_model <- function(
    df = df_uni, cols = cols, is_sandwich = T) {
  # Initialize
  out <- list()

  for (i in seq_along(cols)) {
    # Create formula
    model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

    fit <- suppressWarnings(glm(
      model,
      data = df,
      family = binomial()
    ))

    if (is_sandwich) {
      # Calculate robust standard errors (sandwich)
      robust_se <- sandwich::vcovHC(fit, type = "HC0")
      # Compute robust confidence intervals
      fit_results <- lmtest::coeftest(fit, vcov. = robust_se)
    } else {
      fit_results <- fit
    }

    # Summarize model coefficients
    fit_summary <-
      fit_results %>%
      # broom does not support exponentiation after coeftest so do it manually
      broom::tidy(conf.int = T) %>%
      mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
      arrange(term == "(Intercept)", p.value) %>%
      rename(odds_ratio = estimate, z_value = statistic)

    # Stylize and print
    stylized <-
      fit_summary %>%
      rename(
        "Predictors" = term,
        "Odds Ratios (OR)" = odds_ratio,
        "SE" = std.error,
        "Z-scores" = z_value,
        "P-values" = p.value,
        "CI (low)" = conf.low,
        "CI (high)" = conf.high,
      ) %>%
      mutate(
        `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
        `SE` = round(`SE`, 2),
        `CI (low)` = round(`CI (low)`, 2),
        `CI (high)` = round(`CI (high)`, 2),
        `P-values` = round(`P-values`, 3),
      )

    out[[i]] <- stylized
  }
  return(out)
}

Unadjusted - Original

Fit logistic.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Fit model
out <- fit_model(df = df_uni, cols = cols, is_sandwich = T)

# Create table
univariable_unadjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.63 0.12 3.96650 0.000 1.28 2.07
modified_rankin_score_postop_within_1_week 3.53 0.20 6.44009 0.000 2.40 5.18
locationCorpus Callosum 0.00 0.65 -24.88313 0.000 0.00 0.00
locationOther 0.00 0.90 -17.89582 0.000 0.00 0.00
is_diffuse_nidusTRUE 4.31 0.47 3.10660 0.002 1.71 10.82
is_spetzler_martin_grade_less_than_4TRUE 0.27 0.43 -3.08314 0.002 0.12 0.62
spetzler_martin_grade 1.83 0.20 2.97104 0.003 1.23 2.72
lawton_young_grade 1.47 0.14 2.76206 0.006 1.12 1.92
is_eloquent_locationTRUE 5.68 0.63 2.75760 0.006 1.65 19.54
max_size_cm 1.25 0.09 2.57278 0.010 1.06 1.49
size_score 2.04 0.29 2.48577 0.013 1.16 3.59
has_deficitTRUE 2.57 0.42 2.22961 0.026 1.12 5.90
age_at_first_treatment_yrs 0.88 0.06 -2.19006 0.029 0.79 0.99
modified_rankin_score_presentation 1.25 0.10 2.18077 0.029 1.02 1.54
locationHemispheric 0.33 0.65 -1.67900 0.093 0.09 1.20
has_deep_venous_drainageTRUE 1.64 0.44 1.14105 0.254 0.70 3.86
has_paresisTRUE 1.65 0.44 1.13279 0.257 0.69 3.94
venous_drainageSuperficial 0.54 0.56 -1.10359 0.270 0.18 1.62
procedure_combinationsES 0.24 1.33 -1.06205 0.288 0.02 3.31
procedure_combinationsR 0.40 1.23 -0.74199 0.458 0.04 4.50
procedure_combinationsS 0.43 1.21 -0.68657 0.492 0.04 4.69
procedure_combinationsRS 0.47 1.34 -0.56046 0.575 0.03 6.57
procedure_combinationsER 0.63 1.20 -0.38255 0.702 0.06 6.65
has_associated_aneurysmTRUE 1.20 0.50 0.36617 0.714 0.45 3.18
venous_drainageDeep 0.85 0.55 -0.28807 0.773 0.29 2.50
has_hemorrhageTRUE 0.89 0.42 -0.26488 0.791 0.39 2.05
locationDeep 1.15 0.63 0.21637 0.829 0.33 3.97
has_seizuresTRUE 0.91 0.49 -0.18677 0.852 0.35 2.40
is_surgeryTRUE 0.95 0.42 -0.12746 0.899 0.41 2.17
is_maleTRUE 1.05 0.42 0.10667 0.915 0.46 2.37
procedure_combinationsERS 1.09 1.21 0.07196 0.943 0.10 11.67

Adjusted - Original

Fit logistic.

cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_uni

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.58 0.12 3.71816 0.000 1.24 2.01
modified_rankin_score_postop_within_1_week 3.75 0.25 5.38436 0.000 2.32 6.07
locationCorpus Callosum 0.00 0.62 -25.79715 0.000 0.00 0.00
locationOther 0.00 1.18 -13.40346 0.000 0.00 0.00
spetzler_martin_grade 1.85 0.19 3.19468 0.001 1.27 2.70
lawton_young_grade 1.47 0.13 2.94522 0.003 1.14 1.89
is_eloquent_locationTRUE 6.02 0.60 2.96695 0.003 1.84 19.69
is_diffuse_nidusTRUE 4.04 0.47 2.98448 0.003 1.61 10.09
is_spetzler_martin_grade_less_than_4TRUE 0.28 0.44 -2.90994 0.004 0.12 0.66
size_score 2.11 0.29 2.61905 0.009 1.21 3.69
max_size_cm 1.24 0.09 2.37080 0.018 1.04 1.47
has_deficitTRUE 2.35 0.43 1.98541 0.047 1.01 5.45
modified_rankin_score_presentation 1.20 0.10 1.80312 0.071 0.98 1.48
locationHemispheric 0.35 0.63 -1.65627 0.098 0.10 1.21
has_deep_venous_drainageTRUE 1.83 0.45 1.34464 0.179 0.76 4.39
venous_drainageSuperficial 0.49 0.57 -1.23534 0.217 0.16 1.51
procedure_combinationsES 0.22 1.44 -1.05551 0.291 0.01 3.68
procedure_combinationsS 0.36 1.31 -0.77649 0.437 0.03 4.69
procedure_combinationsR 0.36 1.34 -0.76397 0.445 0.03 4.97
has_paresisTRUE 1.39 0.47 0.70718 0.479 0.56 3.50
procedure_combinationsRS 0.38 1.38 -0.69923 0.484 0.03 5.70
has_associated_aneurysmTRUE 1.29 0.51 0.49767 0.619 0.48 3.46
has_hemorrhageTRUE 0.83 0.43 -0.43565 0.663 0.36 1.93
locationDeep 1.22 0.61 0.32951 0.742 0.37 4.07
procedure_combinationsER 0.65 1.32 -0.32299 0.747 0.05 8.74
is_surgeryTRUE 0.88 0.43 -0.30564 0.760 0.38 2.02
venous_drainageDeep 0.87 0.55 -0.24916 0.803 0.30 2.55
has_seizuresTRUE 0.89 0.50 -0.22783 0.820 0.34 2.37
procedure_combinationsERS 1.08 1.33 0.05904 0.953 0.08 14.61

Unadjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_unadjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.63 0.12 3.96650 0.000 1.28 2.07
modified_rankin_score_postop_within_1_week 3.53 0.20 6.44009 0.000 2.40 5.18
is_diffuse_nidusTRUE 4.31 0.47 3.10660 0.002 1.71 10.82
is_spetzler_martin_grade_less_than_4TRUE 0.27 0.43 -3.08314 0.002 0.12 0.62
spetzler_martin_grade 1.83 0.20 2.97104 0.003 1.23 2.72
lawton_young_grade 1.47 0.14 2.76206 0.006 1.12 1.92
is_eloquent_locationTRUE 5.68 0.63 2.75760 0.006 1.65 19.54
max_size_cm 1.25 0.09 2.57278 0.010 1.06 1.49
size_score 2.04 0.29 2.48577 0.013 1.16 3.59
has_deficitTRUE 2.57 0.42 2.22961 0.026 1.12 5.90
age_at_first_treatment_yrs 0.88 0.06 -2.19006 0.029 0.79 0.99
modified_rankin_score_presentation 1.25 0.10 2.18077 0.029 1.02 1.54
venous_drainageSuperficial 0.60 0.44 -1.18677 0.235 0.25 1.40
has_deep_venous_drainageTRUE 1.64 0.44 1.14105 0.254 0.70 3.86
has_paresisTRUE 1.65 0.44 1.13279 0.257 0.69 3.94
locationDeep 1.85 0.62 0.99492 0.320 0.55 6.20
locationHemispheric 0.54 0.64 -0.97178 0.331 0.16 1.87
procedure_combinationsR 0.40 1.23 -0.74199 0.458 0.04 4.50
procedure_combinationsS 0.43 1.21 -0.68657 0.492 0.04 4.69
procedure_combinationsMultimodal 0.58 1.15 -0.47112 0.638 0.06 5.54
is_surgeryTRUE 0.80 0.53 -0.43177 0.666 0.28 2.23
has_associated_aneurysmTRUE 1.20 0.50 0.36617 0.714 0.45 3.18
has_hemorrhageTRUE 0.89 0.42 -0.26488 0.791 0.39 2.05
has_seizuresTRUE 0.91 0.49 -0.18677 0.852 0.35 2.40
is_maleTRUE 1.05 0.42 0.10667 0.915 0.46 2.37

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.58 0.12 3.71816 0.000 1.24 2.01
modified_rankin_score_postop_within_1_week 3.75 0.25 5.38436 0.000 2.32 6.07
spetzler_martin_grade 1.85 0.19 3.19468 0.001 1.27 2.70
lawton_young_grade 1.47 0.13 2.94522 0.003 1.14 1.89
is_eloquent_locationTRUE 6.02 0.60 2.96695 0.003 1.84 19.69
is_diffuse_nidusTRUE 4.04 0.47 2.98448 0.003 1.61 10.09
is_spetzler_martin_grade_less_than_4TRUE 0.28 0.44 -2.90994 0.004 0.12 0.66
size_score 2.11 0.29 2.61905 0.009 1.21 3.69
max_size_cm 1.24 0.09 2.37080 0.018 1.04 1.47
has_deficitTRUE 2.35 0.43 1.98541 0.047 1.01 5.45
modified_rankin_score_presentation 1.20 0.10 1.80312 0.071 0.98 1.48
venous_drainageSuperficial 0.54 0.45 -1.37472 0.169 0.22 1.30
has_deep_venous_drainageTRUE 1.83 0.45 1.34464 0.179 0.76 4.39
locationDeep 1.92 0.60 1.08512 0.278 0.59 6.24
locationHemispheric 0.56 0.62 -0.95108 0.342 0.17 1.86
procedure_combinationsS 0.37 1.30 -0.76588 0.444 0.03 4.71
procedure_combinationsR 0.36 1.33 -0.75777 0.449 0.03 4.96
has_paresisTRUE 1.39 0.47 0.70718 0.479 0.56 3.50
is_surgeryTRUE 0.71 0.53 -0.63102 0.528 0.25 2.03
has_associated_aneurysmTRUE 1.29 0.51 0.49767 0.619 0.48 3.46
procedure_combinationsMultimodal 0.56 1.26 -0.45818 0.647 0.05 6.63
has_hemorrhageTRUE 0.83 0.43 -0.43565 0.663 0.36 1.93
has_seizuresTRUE 0.89 0.50 -0.22783 0.820 0.34 2.37

Interaction analysis

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Remove unwanted columns
unwanted <- c("modified_rankin_score_final")
cols <- cols[!cols %in% unwanted]

adjustors <- c(
  "age_at_first_treatment_yrs",
  "is_male"
)

# Initialize
out <- list()
df <- df_multi
k <- 1

for (i in seq_along(cols)) {
  # Do not fit model for variables that we are adjusting for
  if (cols[i] %in% adjustors) {
    next
  }

  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i],
    "* has_hemorrhage",
    sep = ""
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[k]] <- stylized
  k <- k + 1
}

Print all results results.

# Define values
unwanted <- c(
  "is_maleTRUE",
  "age_at_first_treatment_yrs",
  "(Intercept)",
  "has_hemorrhageTRUE"
)

# Initialize objects
isolated_values <- list()

# Get main and interaction effects
for (i in seq_along(out)) {
  # Get data
  result <- out[[i]]

  # Get all main effects of interest
  main_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(!str_detect(Predictors, ":")) %>%
    select(-"SE", -"Z-scores")

  # Get the interaction effects
  interaction_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(str_detect(Predictors, ":")) %>%
    mutate(Predictors = str_extract(Predictors, "^[^:]+")) %>%
    select(-"SE", -"Z-scores") %>%
    rename_with(~ paste("Interaction -", .), -Predictors)

  isolated_values[[i]] <-
    main_effects %>%
    left_join(interaction_effects, by = "Predictors")
}

Create and print table

# Create table
univariable_adjusted_recoded_interactions <-
  isolated_values %>%
  bind_rows() %>%
  arrange(`Interaction - P-values`)

# Print
univariable_adjusted_recoded_interactions %>%
  sable()
Predictors Odds Ratios (OR) P-values CI (low) CI (high) Interaction - Odds Ratios (OR) Interaction - P-values Interaction - CI (low) Interaction - CI (high)
is_eloquent_locationTRUE 2.1328e+07 0.000 8910800.83 5.1051e+07 0.00 0.000 0.00 0.00
procedure_combinationsMultimodal 8.1716e+05 0.000 87705.28 7.6136e+06 0.00 0.000 0.00 0.00
procedure_combinationsR 8.0232e+05 0.000 74374.66 8.6552e+06 0.00 0.000 0.00 0.00
procedure_combinationsS 7.9518e+05 0.000 33871.25 1.8668e+07 0.00 0.000 0.00 0.00
has_seizuresTRUE 2.2700e+00 0.227 0.60 8.6300e+00 0.10 0.063 0.01 1.13
has_deficitTRUE 7.1900e+00 0.020 1.36 3.7890e+01 0.17 0.091 0.02 1.32
modified_rankin_score_presentation 2.0600e+00 0.009 1.20 3.5300e+00 0.60 0.112 0.32 1.13
venous_drainageSuperficial 2.6000e-01 0.060 0.06 1.0600e+00 3.36 0.174 0.59 19.31
has_deep_venous_drainageTRUE 3.7700e+00 0.067 0.91 1.5530e+01 0.31 0.186 0.05 1.77
modified_rankin_score_pretreatment 2.5700e+00 0.004 1.34 4.9300e+00 0.65 0.233 0.31 1.32
locationDeep 8.3000e-01 0.884 0.07 9.9500e+00 3.07 0.428 0.19 49.38
has_paresisTRUE 9.2000e-01 0.928 0.16 5.3400e+00 1.93 0.539 0.24 15.66
lawton_young_grade 1.7200e+00 0.022 1.08 2.7300e+00 0.83 0.545 0.46 1.50
size_score 2.7400e+00 0.036 1.07 7.0300e+00 0.70 0.578 0.20 2.44
modified_rankin_score_postop_within_1_week 4.1100e+00 0.001 1.83 9.2000e+00 1.27 0.625 0.49 3.30
locationHemispheric 3.2000e-01 0.363 0.03 3.7700e+00 1.87 0.661 0.11 30.67
spetzler_martin_grade 2.1000e+00 0.020 1.12 3.9300e+00 0.85 0.700 0.38 1.90
is_diffuse_nidusTRUE 4.7200e+00 0.025 1.21 1.8390e+01 0.73 0.743 0.11 4.76
is_surgeryTRUE 9.9000e-01 0.997 0.09 1.1150e+01 0.69 0.784 0.05 10.13
has_associated_aneurysmTRUE 1.5000e+00 0.642 0.27 8.3500e+00 0.83 0.866 0.10 6.87
max_size_cm 1.2700e+00 0.111 0.95 1.6900e+00 0.97 0.880 0.66 1.42
is_spetzler_martin_grade_less_than_4TRUE 2.9000e-01 0.085 0.07 1.1900e+00 0.88 0.893 0.14 5.52

Eloquence.

df %>%
  count(has_hemorrhage, is_eloquent_location, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, is_eloquent_location) %>%
  group_by(has_hemorrhage, is_eloquent_location) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage is_eloquent_location is_poor_outcome n num_total pct_total
FALSE FALSE FALSE 28 28 100%
FALSE TRUE FALSE 52 63 83%
FALSE TRUE TRUE 11 63 17%
TRUE FALSE FALSE 58 61 95%
TRUE FALSE TRUE 3 61 5%
TRUE TRUE FALSE 64 76 84%
TRUE TRUE TRUE 12 76 16%

Deficit.

df %>%
  count(has_hemorrhage, has_deficit, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_deficit) %>%
  group_by(has_hemorrhage, has_deficit) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_deficit is_poor_outcome n num_total pct_total
FALSE FALSE FALSE 55 57 96%
FALSE FALSE TRUE 2 57 4%
FALSE TRUE FALSE 25 34 74%
FALSE TRUE TRUE 9 34 26%
TRUE FALSE FALSE 77 86 90%
TRUE FALSE TRUE 9 86 10%
TRUE TRUE FALSE 45 51 88%
TRUE TRUE TRUE 6 51 12%

Seizures.

df %>%
  count(has_hemorrhage, has_seizures, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_seizures) %>%
  group_by(has_hemorrhage, has_seizures) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_seizures is_poor_outcome n num_total pct_total
FALSE FALSE FALSE 58 64 91%
FALSE FALSE TRUE 6 64 9%
FALSE TRUE FALSE 22 27 81%
FALSE TRUE TRUE 5 27 19%
TRUE FALSE FALSE 94 108 87%
TRUE FALSE TRUE 14 108 13%
TRUE TRUE FALSE 28 29 97%
TRUE TRUE TRUE 1 29 3%

Selective inference

Read the following guides:

Setup

Define unwanted columns.

unwanted_all <- c(
  # Use values at or before first treatment
  "modified_rankin_score_postop_within_1_week",
  "modified_rankin_score_final",
  # Use the split between high vs. low grade instead of the granular
  "spetzler_martin_grade"
)

# Remove hemorrhage from the mix
if (str_detect(params$TITLE, "Hemorrhage")) {
  unwanted_all <- c(unwanted_all, "has_hemorrhage")
}

unwanted_without_gradings <- c(
  unwanted_all,
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4",
  "size_score", # Already covered by the more detailed max_size_cm
  "venous_drainage", # Already covered by has_deep_venous_drainage
  "location", # Already covered to some extent by eloquence
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation" # "modified_rankin_score_pretreatment"
)

# Define cols of interest - use grading scores whenever these exist
unwanted_with_gradings <- c(
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4", # The grade has lower p-value
  "size_score",
  "max_size_cm",
  "is_eloquent_location",
  # "location",  # Include this as not highly correlated with SM
  "has_deep_venous_drainage",
  "venous_drainage",
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # "is_diffuse_nidus",
  # "has_hemorrhage",
  # Use values at or before first treatment
  "modified_rankin_score_final",
  "modified_rankin_score_postop_within_1_week",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation"
  # "modified_rankin_score_pretreatment"
)

Create dataset.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  CHOSEN_OUTCOME
))

# Define column sets for each analysis
cols_all <- cols[!cols %in% unwanted_all]
cols_with_gradings <- cols[!cols %in% unwanted_with_gradings]
cols_without_gradings <- cols[!cols %in% unwanted_without_gradings]

# Create df of interest
df_all <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_all)) %>%
  drop_na()

df_with_gradings <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_with_gradings)) %>%
  drop_na()

df_no_scores <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_without_gradings)) %>%
  drop_na()

# Create formula
frla <- as.formula(paste(CHOSEN_OUTCOME, " ~ . - 1"))

# Create matrices with all variables
X_all <- model.matrix(frla, df_all)
y_all <- df_all %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with scores
X_with_gradings <- model.matrix(frla, df_with_gradings)
y_with_gradings <- df_with_gradings %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with score components
X_without_gradings <- model.matrix(frla, df_no_scores)
y_without_gradings <- df_no_scores %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# X should be centered for LASSO (necessary for glmnet)
# (scale = FALSE as this is done by the standardization step in glmnet)
X_all_scaled <- scale(X_all, center = T, scale = F)
X_with_gradings_scaled <- scale(X_with_gradings, center = T, scale = F)
X_without_gradings_scaled <- scale(X_without_gradings, center = T, scale = F)

# See the names of X to match column numbers to column names later on
colnames(X_without_gradings_scaled)
 [1] "age_at_first_treatment_yrs"         "modified_rankin_score_pretreatment"
 [3] "max_size_cm"                        "is_maleFALSE"                      
 [5] "is_maleTRUE"                        "is_eloquent_locationTRUE"          
 [7] "has_deep_venous_drainageTRUE"       "is_diffuse_nidusTRUE"              
 [9] "has_hemorrhageTRUE"                 "has_seizuresTRUE"                  
[11] "has_deficitTRUE"                    "has_paresisTRUE"                   
[13] "has_associated_aneurysmTRUE"        "is_surgeryTRUE"                    
[15] "procedure_combinationsER"           "procedure_combinationsERS"         
[17] "procedure_combinationsES"           "procedure_combinationsR"           
[19] "procedure_combinationsRS"           "procedure_combinationsS"           

Stepwise

Use step-wise linear regression methods.

# Set seed
set.seed(33)

# Run forward step-wise
fsfit <- fs(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  maxsteps = 2000,
  intercept = TRUE,
  normalize = FALSE
)

# Estimate sigma
sigmahat <- estimateSigma(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  intercept = TRUE,
  standardize = FALSE
)$sigmahat
# Compute sequential p-values and confidence intervals - for all
# (sigma estimated from full model)
out.seq <- fsInf(fsfit, type = "active", sigma = sigmahat)
out.seq

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "active")

Standard deviation of noise (specified or estimated) sigma = 0.304

Sequential testing results with alpha = 0.100
 Step Var        Coef Z-score P-value   LowConfPt   UpConfPt LowTailArea UpTailArea
    1   2  5.9000e-02   4.153   0.047  1.0000e-03 8.1000e-02        0.05      0.049
    2   3  3.0000e-02   3.371   0.386 -7.0000e-02 5.6000e-02        0.05      0.049
    3   6  8.3000e-02   1.855   0.856 -2.4860e+00 1.8200e-01        0.05      0.050
    4   1 -1.0000e-02  -1.872   0.107 -3.0900e-01 2.0000e-02        0.05      0.050
    5   8  9.8000e-02   1.430   0.308 -1.2650e+00 2.9840e+00        0.05      0.050
    6  12 -7.3000e-02  -1.394   0.346 -1.1230e+00 6.1100e-01        0.05      0.050
    7   9 -5.7000e-02  -1.210   0.588 -6.3500e-01 1.0400e+00        0.05      0.050
    8  11  6.2000e-02   0.953   0.332 -8.7300e-01 1.8070e+00        0.05      0.050
    9  10 -3.4000e-02  -0.698   0.642 -7.7400e-01 1.5730e+00        0.05      0.050
   10  17 -3.2000e-02  -0.554   0.783 -1.6170e+00        Inf        0.05      0.000
   11  18 -3.1000e-02  -0.534   0.217 -3.1600e+00 6.9400e-01        0.05      0.050
   12  13 -1.9000e-02  -0.350   0.822 -7.0300e-01 4.8070e+00        0.05      0.050
   13  19 -2.6000e-02  -0.334   0.642 -5.0580e+00        Inf        0.05      0.000
   14  15 -1.9000e-02  -0.308   0.823 -3.3530e+00        Inf        0.05      0.000
   15  14 -5.6000e-02  -0.389   0.630 -3.5200e+00 6.5420e+00        0.05      0.050
   16   4 -1.1000e-02  -0.252   0.370        -Inf        Inf        0.00      0.000
   17   5 -6.9521e+13  -0.682   0.406        -Inf        Inf        0.00      0.000
   18   7  1.1000e-02   0.237   0.132 -9.6100e-01        Inf        0.05      0.000
   19  20  1.3000e-02   0.145   0.155 -1.1810e+00        Inf        0.05      0.000
   20  16  1.8444e+13   0.733   0.426 -1.6398e+15 2.2646e+15        0.05      0.050

Estimated stopping point from ForwardStop rule = 1
# Compute p-values and confidence intervals after AIC stopping - for selected
out.aic <- fsInf(fsfit, type = "aic", sigma = sigmahat)
out.aic

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "aic")

Standard deviation of noise (specified or estimated) sigma = 0.304

Testing results at step = 5, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   2  0.052   3.622   0.060    -0.004    0.161        0.05       0.05
   3  0.017   1.577   0.162    -0.124    0.907        0.05       0.05
   6  0.078   1.738   0.676    -2.397    0.949        0.05       0.05
   1 -0.009  -1.750   0.120    -0.309    0.026        0.05       0.05
   8  0.098   1.430   0.806      -Inf    2.127        0.00       0.05

Estimated stopping point from AIC rule = 5
# Compute p-values and confidence intervals after 5 fixed steps
out.fix <- fsInf(fsfit, type = "all", k = 5, sigma = sigmahat)
out.fix

Call:
fsInf(obj = fsfit, sigma = sigmahat, k = 5, type = "all")

Standard deviation of noise (specified or estimated) sigma = 0.304

Testing results at step = 5, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   2  0.052   3.622   0.073    -0.008    0.077       0.049       0.05
   3  0.017   1.577   0.474    -0.181    0.181       0.050       0.05
   6  0.078   1.738   0.854    -2.398    0.183       0.050       0.05
   1 -0.009  -1.750   0.120    -0.309    0.026       0.050       0.05
   8  0.098   1.430   0.308    -1.265    2.984       0.050       0.05

LASSO - all

Use LASSO logistic regression using all available variables.

# Set seed
set.seed(141845)

# Define values
X <- X_all
y <- y_all

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
modified_rankin_score_pretreatment 1.60197 0.00324 1.23055 2.0246
is_eloquent_locationTRUE 2.84943 0.17871 0.40110 13.7947
age_at_first_treatment_yrs 0.91395 0.34933 0.83718 1.2227
locationHemispheric 0.50580 0.36155 0.19799 5.2418
max_size_cm 1.10608 0.43234 0.55363 1.9074
is_diffuse_nidusTRUE 1.63474 0.48580 0.06877 6.9297
is_spetzler_martin_grade_less_than_4TRUE 0.77613 0.65876 0.11416 235.5170

LASSO - without scores

Use LASSO logistic regression not based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_without_gradings
y <- y_without_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
modified_rankin_score_pretreatment 1.61354 0.00166 1.26101 2.0279
is_eloquent_locationTRUE 3.46368 0.07949 0.78161 8.3229
age_at_first_treatment_yrs 0.91215 0.31697 0.83690 1.1935
max_size_cm 1.11031 0.36610 0.74554 1.3869
is_diffuse_nidusTRUE 1.89468 0.38282 0.13844 7.8960

LASSO - without components

Use LASSO logistic regression based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_with_gradings
y <- y_with_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
modified_rankin_score_pretreatment 1.64446 0.01805 1.13593 2.6366e+00
spetzler_martin_grade 1.62166 0.04841 1.00690 1.2074e+01
age_at_first_treatment_yrs 0.90386 0.29375 0.82804 1.1854e+00
is_diffuse_nidusTRUE 1.71819 0.44392 0.09088 8.1847e+00
locationHemispheric 0.67553 0.45700 0.00000 1.1947e+06
locationDeep 1.17303 0.66810 0.00000 4.3400e+04

Multivariable - Without scores

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_without_gradings]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment" "is_eloquent_location"              
[3] "is_diffuse_nidus"                   "max_size_cm"                       
[5] "has_deficit"                        "has_deep_venous_drainage"          
[7] "age_at_first_treatment_yrs"        

Model selection

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.69 0.16 3.37486 0.001 1.25 2.31
is_eloquent_locationTRUE 4.33 0.71 2.07616 0.038 1.22 20.92
age_at_first_treatment_yrs 0.90 0.06 -1.66660 0.096 0.80 1.02
max_size_cm 1.11 0.10 1.06917 0.285 0.92 1.35
is_diffuse_nidusTRUE 1.85 0.64 0.96196 0.336 0.51 6.43
has_deep_venous_drainageTRUE 1.21 0.52 0.36905 0.712 0.44 3.44
has_deficitTRUE 0.95 0.49 -0.09355 0.925 0.36 2.51
(Intercept) 0.02 1.05 -3.60307 0.000 0.00 0.15

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.69 0.15 3.53064 0.000 1.26 2.26
is_eloquent_locationTRUE 4.33 0.61 2.40662 0.016 1.31 14.26
age_at_first_treatment_yrs 0.90 0.06 -1.55897 0.119 0.80 1.03
max_size_cm 1.11 0.10 1.07003 0.285 0.92 1.34
is_diffuse_nidusTRUE 1.85 0.62 0.99855 0.318 0.55 6.17
has_deep_venous_drainageTRUE 1.21 0.51 0.37804 0.705 0.45 3.26
has_deficitTRUE 0.95 0.48 -0.09700 0.923 0.38 2.42
(Intercept) 0.02 1.25 -3.03296 0.002 0.00 0.26

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 225 (26 with outcome)
  Fitted examples = 224 (26 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
      llh   llhNull        G2  McFadden      r2ML      r2CU 
-63.85680 -80.42133  33.12907   0.20597   0.13748   0.26836 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.801 (SE, 0.041; 95% CI, 0.721-0.882)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.80148 0.80148
m1 1 PRC 0.44948 0.44948

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.801

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event        6        2
  No Event    20      196
                                        
               Accuracy : 0.902         
                 95% CI : (0.855, 0.937)
    No Information Rate : 0.884         
    P-Value [Acc > NIR] : 0.23653       
                                        
                  Kappa : 0.316         
                                        
 Mcnemar's Test P-Value : 0.00029       
                                        
            Sensitivity : 0.2308        
            Specificity : 0.9899        
         Pos Pred Value : 0.7500        
         Neg Pred Value : 0.9074        
             Prevalence : 0.1161        
         Detection Rate : 0.0268        
   Detection Prevalence : 0.0357        
      Balanced Accuracy : 0.6103        
                                        
       'Positive' Class : Event         
                                        

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - Without components

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_unadjusted_recoded %>%
  bind_rows(univariable_adjusted) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment" "is_diffuse_nidus"                  
[3] "spetzler_martin_grade"              "has_deficit"                       
[5] "age_at_first_treatment_yrs"         "location"                          

Model selection

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

fit_with_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.71 0.16 3.34710 0.001 1.25 2.36
spetzler_martin_grade 1.70 0.25 2.10863 0.035 1.05 2.83
age_at_first_treatment_yrs 0.89 0.06 -1.79954 0.072 0.79 1.01
is_diffuse_nidusTRUE 1.73 0.60 0.91123 0.362 0.52 5.63
locationHemispheric 0.61 0.73 -0.67058 0.502 0.15 2.87
locationDeep 1.14 0.79 0.16175 0.872 0.26 5.93
has_deficitTRUE 1.06 0.52 0.10399 0.917 0.38 2.91
(Intercept) 0.03 1.23 -2.93775 0.003 0.00 0.26

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.71 0.15 3.58302 0.000 1.27 2.29
spetzler_martin_grade 1.70 0.28 1.87378 0.061 0.98 2.96
age_at_first_treatment_yrs 0.89 0.06 -1.73772 0.082 0.79 1.01
is_diffuse_nidusTRUE 1.73 0.58 0.94573 0.344 0.56 5.39
locationHemispheric 0.61 0.69 -0.70868 0.479 0.16 2.38
locationDeep 1.14 0.83 0.15294 0.878 0.22 5.80
has_deficitTRUE 1.06 0.54 0.09973 0.921 0.37 3.03
(Intercept) 0.03 1.45 -2.49016 0.013 0.00 0.46

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 228 (26 with outcome)
  Fitted examples = 224 (26 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
      llh   llhNull        G2  McFadden      r2ML      r2CU 
-63.56292 -80.42133  33.71682   0.20963   0.13974   0.27277 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.807 (SE, 0.043; 95% CI, 0.722-0.891)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.80653 0.80653
m1 1 PRC 0.47167 0.47167

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.807

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event        6        1
  No Event    20      197
                                       
               Accuracy : 0.906        
                 95% CI : (0.86, 0.941)
    No Information Rate : 0.884        
    P-Value [Acc > NIR] : 0.175        
                                       
                  Kappa : 0.331        
                                       
 Mcnemar's Test P-Value : 8.57e-05     
                                       
            Sensitivity : 0.2308       
            Specificity : 0.9949       
         Pos Pred Value : 0.8571       
         Neg Pred Value : 0.9078       
             Prevalence : 0.1161       
         Detection Rate : 0.0268       
   Detection Prevalence : 0.0312       
      Balanced Accuracy : 0.6129       
                                       
       'Positive' Class : Event        
                                       

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - With interactions

Presents with hemorrhage

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  "modified_rankin_score_pretreatment*has_hemorrhage +
  spetzler_martin_grade*has_hemorrhage +
  age_at_first_treatment_yrs*has_hemorrhage"
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 2.39 0.39 2.25045 0.024 1.14 5.36
spetzler_martin_grade 1.88 0.35 1.80269 0.071 1.00 4.05
has_hemorrhageTRUE:age_at_first_treatment_yrs 0.84 0.13 -1.37268 0.170 0.65 1.07
modified_rankin_score_pretreatment:has_hemorrhageTRUE 0.76 0.44 -0.63679 0.524 0.31 1.76
has_hemorrhageTRUE 3.94 2.36 0.58165 0.561 0.05 571.97
age_at_first_treatment_yrs 0.98 0.08 -0.26729 0.789 0.83 1.17
has_hemorrhageTRUE:spetzler_martin_grade 1.10 0.45 0.21443 0.830 0.44 2.62
(Intercept) 0.01 1.91 -2.68201 0.007 0.00 0.16

Model comparison

Likelihood ratio

No statistical evidence that the grading-based model fits the data better than the grading-independent model based on the likelihood ratio test.

anova(fit_without_grading, fit_with_grading, test = "LR")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
216 127.71
216 127.13 0 0.58775

Special cases

SM < 4

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_spetzler_martin_grade_less_than_4")
predictors <- predictors[!predictors == "spetzler_martin_grade"]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment"   "is_diffuse_nidus"                    
[3] "has_deficit"                          "age_at_first_treatment_yrs"          
[5] "is_spetzler_martin_grade_less_than_4"

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.67 0.15 3.37539 0.001 1.24 2.27
is_spetzler_martin_grade_less_than_4TRUE 0.33 0.54 -2.08117 0.037 0.11 0.94
age_at_first_treatment_yrs 0.92 0.06 -1.38754 0.165 0.82 1.04
is_diffuse_nidusTRUE 2.21 0.58 1.37945 0.168 0.71 6.87
has_deficitTRUE 1.21 0.48 0.39042 0.696 0.47 3.10
(Intercept) 0.16 0.83 -2.18227 0.029 0.03 0.78

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.67 0.14 3.6491 0.000 1.27 2.20
is_spetzler_martin_grade_less_than_4TRUE 0.33 0.56 -1.9983 0.046 0.11 0.98
is_diffuse_nidusTRUE 2.21 0.55 1.4404 0.150 0.75 6.51
age_at_first_treatment_yrs 0.92 0.06 -1.2769 0.202 0.81 1.05
has_deficitTRUE 1.21 0.48 0.3903 0.696 0.47 3.08
(Intercept) 0.16 0.83 -2.1884 0.029 0.03 0.83

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Had surgery

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_surgery")
predictors <- predictors[!predictors == "procedure_combinations"]

# Print
print(predictors)
[1] "modified_rankin_score_pretreatment" "spetzler_martin_grade"             
[3] "is_diffuse_nidus"                   "has_deficit"                       
[5] "age_at_first_treatment_yrs"         "is_surgery"                        

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.70 0.16 3.33870 0.001 1.25 2.35
spetzler_martin_grade 1.96 0.27 2.53418 0.011 1.19 3.39
age_at_first_treatment_yrs 0.90 0.06 -1.73441 0.083 0.80 1.01
is_diffuse_nidusTRUE 1.60 0.60 0.77963 0.436 0.48 5.24
is_surgeryTRUE 1.51 0.69 0.59102 0.555 0.37 5.78
has_deficitTRUE 1.12 0.49 0.24000 0.810 0.43 2.94
(Intercept) 0.01 1.24 -3.57311 0.000 0.00 0.12

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_pretreatment 1.70 0.15 3.65408 0.000 1.28 2.27
spetzler_martin_grade 1.96 0.30 2.21816 0.027 1.08 3.56
age_at_first_treatment_yrs 0.90 0.06 -1.65532 0.098 0.80 1.02
is_diffuse_nidusTRUE 1.60 0.59 0.79429 0.427 0.50 5.13
is_surgeryTRUE 1.51 0.73 0.55889 0.576 0.36 6.35
has_deficitTRUE 1.12 0.49 0.24017 0.810 0.43 2.92
(Intercept) 0.01 1.59 -2.78434 0.005 0.00 0.27

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Grade 5s

df_uni %>%
  drop_na(spetzler_martin_grade) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 1 38 17%
FALSE 2 45 20%
FALSE 3 65 29%
FALSE 4 37 16%
FALSE 5 17 7%
TRUE 1 2 1%
TRUE 2 3 1%
TRUE 3 6 3%
TRUE 4 7 3%
TRUE 5 8 4%

Only consider those with SM grade 5.

df_uni %>%
  filter(spetzler_martin_grade == 5) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 5 17 68%
TRUE 5 8 32%

Write

Setup

Create the necessary directories.

# Get today's date
today <- Sys.Date()
today <- format(today, "%Y-%m-%d")

# Create folder names
base_folder <- file.path(DST_DIRNAME, ANALYSIS_NAME)
date_folder <- file.path(base_folder, today)
csv_folder <- file.path(date_folder, "csv")
pdf_folder <- file.path(date_folder, "pdf")
html_folder <- file.path(date_folder, "html")

# Create folders
suppressWarnings(dir.create(base_folder))
suppressWarnings(dir.create(date_folder))
suppressWarnings(dir.create(csv_folder))
suppressWarnings(dir.create(pdf_folder))
suppressWarnings(dir.create(html_folder))

# Print folder names
print(base_folder)
print(date_folder)
print(csv_folder)
print(pdf_folder)
print(html_folder)
[1] "../../outputs//predictive-analytics/poor-outcomes"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-30"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-30/csv"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-30/pdf"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-30/html"

Figures

Write all figures.

# Save
# ggsave(
#   file.path(pdf_folder, "histogram_num_days.pdf"),
#   plot = plots$histogram_num_days,
#   width = 6,
#   height = 6
# )
# # Start graphic device
# pdf(
#   file = file.path(pdf_folder, "forest-plot_frequentist.pdf"),
#   width = 10,
#   height = 15
# )
#
# # Plot
# plots$forest_plot_frequentist
#
# # Shut down device
# dev.off()

Tables

Write all tables.

# # Arguments
# df <- tables$desc_stats_cohorts_cv_prolaio
# filepath_csv <- file.path(csv_folder, "desc-stats_by-cohort_cov.csv")
# filepath_html <- file.path(html_folder, "desc-stats_by-cohort_cov.html")
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )
#
# # Save as HTML
# # - Save pdf does not work with webshot
# # - It works with pagedown but not as pretty as desired
# df %>%
#   sable() %>%
#   kableExtra::save_kable(file = filepath_html)
write_csv(
  univariable_unadjusted,
  file.path(csv_folder, "univariable_unadjusted.csv")
)
write_csv(
  univariable_adjusted,
  file.path(csv_folder, "univariable_adjusted.csv")
)
write_csv(
  multivariable_pvalue,
  file.path(csv_folder, "multivariable_pvalue.csv")
)

Data

Write all data.

# # Arguments
# df <- study
# filename <- paste0(ANALYSIS_NAME, "_", today, ".csv")
# filepath_csv <- file.path(DST_BUCKET, dst_dirname_data, filename)
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )

Reproducibility

Linting and styling

# Style current file
styler::style_file(
  path = rstudioapi::getSourceEditorContext()$path,
  style = styler::tidyverse_style
)

# Lint current file
lintr::lint(rstudioapi::getSourceEditorContext()$path)

Dependency management

# Clean up project of libraries not in use
# (use prompt = FALSE to avoid the interactive session)
# (packages can only be removed in interactive mode b/c this is destructive)
renv::clean(prompt = TRUE)

# Update lock file with new packages
renv::snapshot()

Containerization

# Only run this if option is set to TRUE
if (UPDATE_DOCKERFILE) {
  # Create a dockerfile from the session info
  my_dockerfile <- containerit::dockerfile(from = sessionInfo(), env = ls())
  # Write file
  write(my_dockerfile, file = "~/Dockerfile")
  # Print
  print(my_dockerfile)
}

Documentation

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.11.0

attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base

other attached packages:
[1] rmarkdown_2.25 table1_1.4.3 shiny_1.8.0
[4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[10] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
[13] selectiveInference_1.2.5 MASS_7.3-60 adaptMCMC_1.5
[16] coda_0.19-4.1 survival_3.5-7 intervals_0.15.4
[19] glmnet_4.1-8 Matrix_1.6-1.1 magrittr_2.0.3
[22] ggplot2_3.5.0 kableExtra_1.4.0 rmdformats_1.0.4
[25] knitr_1.45

loaded via a namespace (and not attached):
[1] splines_4.3.2 later_1.3.2 versions_0.3
[4] R.oo_1.26.0 hardhat_1.4.0 pROC_1.18.5
[7] rpart_4.1.21 rex_1.2.1 lifecycle_1.0.4
[10] tcltk_4.3.2 globals_0.16.3 processx_3.8.3
[13] lattice_0.21-9 vroom_1.6.5 backports_1.4.1
[16] sass_0.4.8 jquerylib_0.1.4 yaml_2.3.8
[19] remotes_2.4.2.1 httpuv_1.6.14 summarytools_1.0.1
[22] pkgload_1.3.4 R.cache_0.16.0 R.utils_2.12.3
[25] styler_1.10.2 nnet_7.3-19 sandwich_3.1-0
[28] ipred_0.9-14 lava_1.8.0 listenv_0.9.1
[31] parallelly_1.37.1 svglite_2.1.3 codetools_0.2-19
[34] xml2_1.3.6 tidyselect_1.2.1 precrec_0.14.4
[37] shape_1.4.6.1 futile.logger_1.4.3 farver_2.1.1
[40] matrixStats_1.3.0 stats4_4.3.2 base64enc_0.1-3
[43] jsonlite_1.8.8 caret_6.0-94 e1071_1.7-14
[46] ellipsis_0.3.2 Formula_1.2-5 iterators_1.0.14
[49] systemfonts_1.0.5 foreach_1.5.2 tools_4.3.2
[52] pryr_0.1.6 Rcpp_1.0.12 glue_1.7.0
[55] gridExtra_2.3 prodlim_2023.08.28 xfun_0.42
[58] withr_3.0.0 formatR_1.14 BiocManager_1.30.22
[61] fastmap_1.1.1 fansi_1.0.6 callr_3.7.5
[64] digest_0.6.34 lintr_3.1.1 timechange_0.3.0
[67] R6_2.5.1 mime_0.12 colorspace_2.1-0
[70] R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3
[73] renv_1.0.4 data.table_1.15.4 recipes_1.0.10
[76] class_7.3-22 stevedore_0.9.6 ModelMetrics_1.2.2.2
[79] pkgconfig_2.0.3 gtable_0.3.4 timeDate_4032.109
[82] lmtest_0.9-40 containerit_0.6.0.9004 htmltools_0.5.7
[85] bookdown_0.39 scales_1.3.0 cyclocomp_1.1.1
[88] gower_1.0.1 lambda.r_1.2.4 rstudioapi_0.15.0
[91] tzdb_0.4.0 reshape2_1.4.4 checkmate_2.3.1
[94] nlme_3.1-163 curl_5.2.1 ggcorrplot_0.1.4.1
[97] proxy_0.4-27 zoo_1.8-12 cachem_1.0.8
[100] miniUI_0.1.1.1 desc_1.4.3 pillar_1.9.0
[103] grid_4.3.2 vctrs_0.6.5 pscl_1.5.9
[106] promises_1.2.1 shinyFiles_0.9.3 xtable_1.8-4
[109] evaluate_0.23 cvAUC_1.1.4 magick_2.8.3
[112] cli_3.6.2 compiler_4.3.2 futile.options_1.0.1
[115] rlang_1.1.3 crayon_1.5.2 future.apply_1.11.2
[118] labeling_0.4.3 ps_1.7.6 plyr_1.8.9
[121] fs_1.6.3 stringi_1.8.3 pander_0.6.5
[124] viridisLite_0.4.2 assertthat_0.2.1 munsell_0.5.0
[127] lazyeval_0.2.2 rapportools_1.1 hms_1.1.3
[130] bit64_4.0.5 future_1.33.2 highr_0.10
[133] ROCR_1.0-11 semver_0.2.0 broom_1.0.6
[136] memoise_2.0.1 bslib_0.6.1 bit_4.0.5

References

[[1]]
Scheidegger A, , (2024). _adaptMCMC: Implementation of a Generic Adaptive Monte
Carlo Markov Chain Sampler_. R package version 1.5,
<https://CRAN.R-project.org/package=adaptMCMC>.

[[2]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[3]]
Plummer M, Best N, Cowles K, Vines K (2006). "CODA: Convergence Diagnosis and
Output Analysis for MCMC." _R News_, *6*(1), 7-11.
<https://journal.r-project.org/archive/>.

[[4]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[5]]
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar
of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.

[[6]]
Wickham H (2023). _forcats: Tools for Working with Categorical Variables
(Factors)_. R package version 1.0.0,
<https://CRAN.R-project.org/package=forcats>.

[[7]]
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.

[[8]]
Friedman J, Tibshirani R, Hastie T (2010). "Regularization Paths for
Generalized Linear Models via Coordinate Descent." _Journal of Statistical
Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.

Simon N, Friedman J, Tibshirani R, Hastie T (2011). "Regularization Paths for
Cox's Proportional Hazards Model via Coordinate Descent." _Journal of
Statistical Software_, *39*(5), 1-13. doi:10.18637/jss.v039.i05
<https://doi.org/10.18637/jss.v039.i05>.

Tay JK, Narasimhan B, Hastie T (2023). "Elastic Net Regularization Paths for
All Generalized Linear Models." _Journal of Statistical Software_, *106*(1),
1-31. doi:10.18637/jss.v106.i01 <https://doi.org/10.18637/jss.v106.i01>.

[[9]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[10]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[11]]
Bourgon R (2023). _intervals: Tools for Working with Points and Intervals_. R
package version 0.15.4, <https://CRAN.R-project.org/package=intervals>.

[[12]]
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and Pipe
Syntax_. R package version 1.4.0,
<https://CRAN.R-project.org/package=kableExtra>.

[[13]]
Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report Generation
in R_. R package version 1.45, <https://yihui.org/knitr/>.

Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition. Chapman and
Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <https://yihui.org/knitr/>.

Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in R." In
Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible Computational
Research_. Chapman and Hall/CRC. ISBN 978-1466561595.

[[14]]
Grolemund G, Wickham H (2011). "Dates and Times Made Easy with lubridate."
_Journal of Statistical Software_, *40*(3), 1-25.
<https://www.jstatsoft.org/v40/i03/>.

[[15]]
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package
version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.

[[16]]
Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_, Fourth
edition. Springer, New York. ISBN 0-387-95457-0,
<https://www.stats.ox.ac.uk/pub/MASS4/>.

[[17]]
Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix Classes
and Methods_. R package version 1.6-1.1,
<https://CRAN.R-project.org/package=Matrix>.

[[18]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[19]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[20]]
Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R package
version 1.0.2, <https://CRAN.R-project.org/package=purrr>.

[[21]]
Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R
package version 2.1.5, <https://CRAN.R-project.org/package=readr>.

[[22]]
Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A,
Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown: Dynamic Documents
for R_. R package version 2.25, <https://github.com/rstudio/rmarkdown>.

Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive Guide_.
Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338,
<https://bookdown.org/yihui/rmarkdown>.

Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman and
Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,
<https://bookdown.org/yihui/rmarkdown-cookbook>.

[[23]]
Barnier J (2022). _rmdformats: HTML Output Formats and Templates for
'rmarkdown' Documents_. R package version 1.0.4,
<https://CRAN.R-project.org/package=rmdformats>.

[[24]]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J (2019).
_selectiveInference: Tools for Post-Selection Inference_. R package version
1.2.5, <https://CRAN.R-project.org/package=selectiveInference>.

[[25]]
Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, McPherson
J, Dipert A, Borges B (2023). _shiny: Web Application Framework for R_. R
package version 1.8.0, <https://CRAN.R-project.org/package=shiny>.

[[26]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[27]]
Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String
Operations_. R package version 1.5.1,
<https://CRAN.R-project.org/package=stringr>.

[[28]]
Therneau T (2023). _A Package for Survival Analysis in R_. R package version
3.5-7, <https://CRAN.R-project.org/package=survival>.

Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival Data:
Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.

[[29]]
Rich B (2023). _table1: Tables of Descriptive Statistics in HTML_. R package
version 1.4.3, <https://CRAN.R-project.org/package=table1>.

[[30]]
Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package version
3.2.1, <https://CRAN.R-project.org/package=tibble>.

[[31]]
Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package
version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.

[[32]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G,
Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K,
Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K,
Yutani H (2019). "Welcome to the tidyverse." _Journal of Open Source Software_,
*4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

[[33]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.